Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning
YAO Xinghu, TAN Xiaoyang
Journal of Computer Applications    2021, 41 (1): 1-7.   DOI: 10.11772/j.issn.1001-9081.2020061009
Abstract481)      PDF (1410KB)(1365)       Save
For the problem of exponential explosion of joint action space with the increase of the number of agents in multi-agent systems, the "central training-decentralized execution" framework was adopted to solve the curse of dimensionality of joint action space and reduce the optimization cost of the algorithm. A new global credit assignment mechanism, Reward HighWay Network (RHWNet), was proposed to solve the problem that only the global reward corresponding to the joint behavior of all agents was given by the environment in multiple multi-agent reinforcement learning scenarios. By introducing the reward highway connection in the global reward assignment mechanism of the original algorithm, the value function of each agent was directly connected with the global reward, so that each agent was able to consider both the global reward signal and its actual reward value when making strategy selection. Firstly, in the training process, each agent was coordinated through a centralized value function structure. At the same time, this centralized structure was also able to play a role in global reward assignment. Then, the reward highway connection was introduced in the central value function structure to assist the global reward assignment, thus establishing the reward highway network. Then, in the execution phase, each agent's strategy depended only on its own value function. Experimental results on the StarCraft Multi-Agent Challenge (SMAC) microoperation scenarios show that the proposed reward highway network achieves a performance improvement of more than 20% in testing winning rate on four complex maps compared to the advanced Counterfactual multi-agent policy gradient (Coma) and QMIX algorithms. More importantly, in 3s5z and 3s6z scenarios with a large number and different types of agents, the proposed network can achieve better results when the required number of samples is only 30% of algorithms such as Coma and QMIX.
Reference | Related Articles | Metrics